library(ggplot2)
library(dplyr)
library(ggmap)
library(tidyr)
library(cluster)
library(lubridate)
library(rgl)
library(maptools)
library(ggpubr)
library(viridis)
library(tidyverse)
library(scatterplot3d)
library(factoextra)
library(fpc)
library(NbClust)
library(reshape2)
library(scales)
library(plyr)
library(caret)
library(ROCR)
library(pROC)
library(rlist)
library(Hmisc)
library(corrplot)
library(ggcorrplot)
library(DMwR)
library(ggthemes)
library(e1071)
library(maps)
library(RColorBrewer)
library(rgl)

model_dir = "models"
data_dir = "data"
map_dir = "map"
saved_maps = list.files(map_dir)
### Loading data
data=read.csv(paste(data_dir,"MCI_2014_to_2017.csv",sep="/"), header = TRUE, sep = ",")
### Loading map
for(file in saved_maps) {
  load(paste(map_dir,file,sep="/"))
}

I. Introduction

The second assignment for the York Machine Learning course, Machine Learning in Business Context was to explore unsupervised machine learning algorithms, specifically clustering. We chose a dataset from the Toronto Police Sercive Public Safety Data Portal, MCI 2014 to 2017. This report follows the structures laid out in CRISP-DM methodology.

The GitHub repository for all the source code is located at the following link: link here.

The RShiny app is located at the following link: link here.

II. Business Understanding

The Toronto Police Service is the police force that serves the Greater Toronto Area. It is the largest municipal police force in Canada and the third largest police force in Canada. They are a taxpayer-funded service, ranking as second for the government of Toronto’s budgetary expenses.

The objective of this model is to cluster crimes to determine which areas of Toronto have the most levels of crime, overall and for different Major Crime Indicators. This would enable the Toronto Police to most effectively allocate their officers and specialists to the areas that require them the most. The hope is that this would be an effective way to lower crime rates and enable more cases to be solved, all without spending more money.

There are some ethical implications of using crime data. There are many avenues for bias to enter the data set. Police services around North America have come under increased scrutiny in recent years for racist policing. Some police policies or laws inherently disadvantage certain groups of people, which would create bias in the data. This means that conclusions drawn from a machine learning model based on biased data would create biased results. The conclusions should be looked at with other data, such as demographic data, and supplementary information, such as social considerations.

III. Data Understanding

The data set was provided courtesy of the Toronto Police Service Open Data Portal. It is usable in accordance with the Open Government License - Ontario.

The data concerns all Major Crime Indicators (MCI) occurrences in Toronto by reported date, between 2014 and 2017. The MCIs are Assault, Break and Enter, Auto Theft, and Theft Over (excludes Sexual Assaults). Locations in the data set are approximate, as they have been deliberately offset to the nearest road intersection to protect the privacy of involved individuals.

There are 29 columns with 131,073 observations. However, 4 of these columns are duplicates, bringing us to 25 columns. There are many attributes: date/time-type data, crime-type data, and location type data.

The following table shows the date/time attributes:

Attribute Description
Occurrence Date Date of occurrence
Occurrence Year Year of occurrence
Occurrence Month Month of occurrence
Occurrence Day Day of occurrence
Occurrence Day of Year Day of year of occurrence
Occurrence Day of Week Day of week of occurrence
Occurrence Hour Hour of occurrence
Reported Date Date of report
Reported Year Year of report
Reported Month Month of report
Reported Day Day of report
Reported Day of Year Day of year of report
Reported Day of Week Day of week of report
Reported Hour Hour of report

While this table shows the categorical attributes:

Attribute Description
Premise Type The type of premise (outside, house, commercial, etc.) where the crime happened
Major Crime Indicator The major crime indicator (type of crime)
Offence The specific offence that happened
UCR Code Uniform Crime Reporting Code - categorizes the crime
UCR Extension Uniform CRime Reporting Extension - further categorizes the crime

This table shows location attributes:

Attribute Description
Division The Toronto Police Division
Neighbourhood Toronto Neighbourhood
Hood ID Toronto neighbourhood ID
Lat Latitude
Long Longitude

IV. Data Exploration and Preparation

(2) Visualizations for the division and neighbourhood attributes

Visualization crime level for all neighbourhoods.

shpfile <- paste(data_dir,"NEIGHBORHOODS_WGS84_2.shp",sep="/")
sh <- readShapePoly(shpfile)
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
sh@data$AREA_S_CD <- as.integer(sh@data$AREA_S_CD)
total_offence_cnt_table = data %>% group_by(Hood_ID) %>% dplyr::summarise(offence_cnt = n())
hood_total_offence_cnt_table = merge(total_offence_cnt_table,sh@data,by.x='Hood_ID',by.y='AREA_S_CD')
points_offense_cnt <- fortify(sh, region = 'AREA_S_CD')
points_offense_cnt <- merge(points_offense_cnt, hood_total_offence_cnt_table, by.x='id', by.y='Hood_ID', all.x=TRUE)
torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=offence_cnt), data=points_offense_cnt, color='black') +
  scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))

Visualization Toronto Crimes Heatmap across neighboughhoods

base_size <- 9
    heat_group <- group_by(toronto, Neighbourhood, offence)
    heat_count <- dplyr::summarise(heat_group,Total = n())
    heat_count$Neighbourhood <- with(heat_count,reorder(Neighbourhood,Total))
    heat_count.m <- melt(heat_count)
## Using Neighbourhood, offence as id variables
    heat_count.m <- ddply(heat_count.m, .(variable), transform,rescale = rescale(value))
    ggplot(heat_count.m, aes(Neighbourhood, offence)) + 
        geom_tile(aes(fill = rescale),colour = "white") + 
        ggtitle("Toronto Criminal Heatmap") +
        scale_fill_gradient(low = "lightblue",high = "darkblue") +
        theme_grey(base_size = base_size) + 
        labs(x = "", y = "") + 
        scale_x_discrete(expand = c(0, 0)) +
        scale_y_discrete(expand = c(0, 0)) +
        theme_minimal() +
        theme(legend.position = "none",axis.ticks = element_blank(), axis.text.x = element_text(size = base_size *0.5, angle = 270, hjust = 0, colour = "grey50"),axis.text.y = element_text(size = base_size *0.5))

We also can visualize some interesting information from the dataset e.g.: Finding the top dangerous or top safe zones, neighbourhoods in Toronto based on all MCI or on a specific MCI. The following charts display top 5 dangerous and top 5 safe neighbourhoods in Toronto on total MCI. For more options on other top neighbourhoods/divisions on total MCI or for a specific MCI, please explore on our application.

location_group <- group_by(toronto, Neighbourhood)
crime_by_location <- dplyr::summarise(location_group, n=n())
crime_dangerous <- crime_by_location[order(crime_by_location$n, decreasing = TRUE), ]
crime_dangerous_top <- head(crime_dangerous, 5)
ggplot(aes(x = reorder(Neighbourhood, n), y = n, fill = reorder(Neighbourhood, n)), data = crime_dangerous_top) +
          geom_bar(stat = 'identity', width = 0.6) +
          ggtitle("Top 5 dangerous neighbourhoods in Toronto") +
          coord_flip() +
          xlab('Zone') +
          ylab('Number of Occurrences') +
          scale_fill_brewer(palette = "Reds") +
          theme(plot.title = element_text(size = 16),
                axis.title = element_text(size = 12, face = "bold"),
                legend.position="none")

crime_safe <-  crime_by_location[order(crime_by_location$n, decreasing = FALSE), ]
crime_safe_top <- head(crime_safe, 5)
ggplot(aes(x = reorder(Neighbourhood, -n), y = n, fill = reorder(Neighbourhood, -n)), data = crime_safe_top) +
        geom_bar(stat = 'identity', width = 0.6) +
        ggtitle("Top 5 safe neighbourhoods in Toronto") +
        coord_flip() +
        xlab('Zone') +
        ylab('Number of Occurrences') +
        scale_fill_brewer(palette = "Greens") +
        theme(plot.title = element_text(size = 16),
              axis.title = element_text(size = 12, face = "bold"),
              legend.position="none")

(3) Data outliers

It seems like someone reported crimes over 40 years ago. But since nothing stops people from doing this, we are not going to treat these as outliers.

data$occurrencedate <- ymd(gsub("(.*)T.*", "\\1", data$occurrencedate))
data$reporteddate <- ymd(gsub("(.*)T.*", "\\1", data$reporteddate))
data[which(data$occurrencedate < as.POSIXct("1970-01-01")),]
##               X        Y Index_ event_unique_id occurrencedate
## 1257  -79.53082 43.64304    147   GO-2015636294     1966-06-09
## 11889 -79.44592 43.68811   5156  GO-20143212768     1968-01-01
##       reporteddate premisetype ucr_code ucr_ext offence reportedyear
## 1257    2015-04-17   Apartment     1430     100 Assault         2015
## 11889   2014-10-31       House     1430     100 Assault         2014
##       reportedmonth reportedday reporteddayofyear reporteddayofweek
## 1257          April          17               107        Friday    
## 11889       October          31               304        Friday    
##       reportedhour occurrenceyear occurrencemonth occurrenceday
## 1257            15             NA                            NA
## 11889           11             NA                            NA
##       occurrencedayofyear occurrencedayofweek occurrencehour     MCI
## 1257                   NA                                  0 Assault
## 11889                  NA                                  9 Assault
##       Division Hood_ID                   Neighbourhood      Lat      Long
## 1257       D22      14 Islington-City Centre West (14) 43.64304 -79.53082
## 11889      D13     107           Oakwood Village (107) 43.68811 -79.44592
##        FID
## 1257   257
## 11889 5889

The rest of the data is mostly categorical in their nature, so no concerns need to be addressed in terms of data outliers here.

(4) Data Preprocessing

The following code documents the preprocessing steps idendified for this dataset.

#reload the data
data <- read.csv(paste(data_dir,"MCI_2014_to_2017.csv",sep="/"), header = TRUE, sep = ",")

First, we want to remove any duplicate data - rows or columns. Some events have duplicated event IDs and should be removed. We also have duplicate columns for X/Y and Lat/Long, which should be removed. We are don’t use the UCR codes or the ID numbers, so they’re also removed.

#remove duplicate rows/entries
data <- subset(data, !duplicated(data$event_unique_id))

#remove columns that aren't used/duplicates
data <- data[, !colnames(data) %in% c("?..X","Y","Index_","event_unique_id","ucr_code","ucr_ext","FID")]

Next, we format the dates. There are garbage time values at the end of the dates, which are removed. The other date values are also changed into appropriate date/time values. Whitespace is also present in the day of week columns, so that is trimmed.

#formatting dates - remove garbage time values at the end
data$occurrencedate <- gsub("(.*)T.*", "\\1", data$occurrencedate)
data$reporteddate <- gsub("(.*)T.*", "\\1", data$reporteddate)
data$occurrencetime = ymd_h(paste(data$occurrencedate,data$occurrencehour,sep = " "), tz="EST")
data$reportedtime = ymd_h(paste(data$reporteddate,data$reportedhour,sep = " "), tz="EST")
data$occurrencedate = ymd(data$occurrencedate)
data$reporteddate = ymd(data$reporteddate)

#removing whitespace from day of week
data$occurrencedayofweek <- as.factor(trimws(data$occurrencedayofweek, "b"))
data$reporteddayofweek <- as.factor(trimws(data$reporteddayofweek, "b"))

We also convert the month/day of week from string representation to integer representation:

data$reportedmonth = month(data$reportedtime)
data$reporteddayofweek = wday(data$reportedtime)
data$occurrencemonth = month(data$occurrencetime)
data$occurrencedayofweek = wday(data$occurrencetime)

Now let’s take a look at the missing data:

#missing data
colSums(is.na(data))
##                   X      occurrencedate        reporteddate 
##                   0                   0                   0 
##         premisetype             offence        reportedyear 
##                   0                   0                   0 
##       reportedmonth         reportedday   reporteddayofyear 
##                   0                   0                   0 
##   reporteddayofweek        reportedhour      occurrenceyear 
##                   0                   0                  32 
##     occurrencemonth       occurrenceday occurrencedayofyear 
##                   0                  32                  32 
## occurrencedayofweek      occurrencehour                 MCI 
##                   0                   0                   0 
##            Division             Hood_ID       Neighbourhood 
##                   0                   0                   0 
##                 Lat                Long      occurrencetime 
##                   0                   0                   0 
##        reportedtime 
##                   0
NAdata <- unique (unlist (lapply (data, function (x) which (is.na (x)))))

Rows with NA data:

NAdata
##  [1]   921   922   923   964   986  1036  1037  1086  1087  1088  1157
## [12]  1158  1159  1211  1228  1508  1509  1510  1689  1690  1691  1692
## [23]  3330  3331 10372 10373 13858 13859 13860 13861 95190 98356

Occurence dates for rows with NA data:

data$occurrencedate[NAdata]
##  [1] "1989-01-01" "1998-01-01" "1991-01-01" "1998-01-01" "1999-10-01"
##  [6] "1998-01-01" "1996-01-01" "1966-06-09" "1980-04-24" "1970-01-01"
## [11] "1999-01-01" "1996-01-01" "1992-12-21" "1996-01-31" "1998-06-01"
## [16] "1974-01-01" "1995-01-01" "1990-01-01" "1988-01-01" "1988-01-01"
## [21] "1973-08-31" "1999-03-24" "1999-01-01" "1982-03-13" "1968-01-01"
## [26] "1995-01-01" "1994-01-01" "1989-01-01" "1987-01-01" "1989-11-20"
## [31] "1996-01-01" "1978-04-10"

We can see that there are 32 missing values, all in occurrence date related columns. Upon further inspection, these are all the same rows. We can also see that the occurrence date value is correct, so these date type columns can have their missing values imputed.

#imputing occurence dates from occurence date field
data$occurrenceyear[NAdata] <- year(data$occurrencedate[NAdata])
data$occurrencemonth[NAdata] <- month(data$occurrencedate[NAdata], label = TRUE, abbr = FALSE)
data$occurrenceday[NAdata] <- day(data$occurrencedate[NAdata])
data$occurrencedayofweek[NAdata] <- wday(data$occurrencedate[NAdata], label = TRUE, abbr = FALSE)
data$occurrencedayofyear[NAdata] <- yday(data$occurrencedate[NAdata])

Now we replace the space in the strings with an underscore for easier processing:

#replace space in string
data$offence <- gsub("\\s", "_", data$offence)
data$MCI <- gsub("\\s", "_", data$MCI)

Next, all columns are converted into factors except Lat, Long, reportedtime, and occurrencetime. Unused factor levels are also dropped (resulted from missing values).

#change things to factors
for(col in c("offence","MCI","Division","Hood_ID")) {
  data[,col] = as.factor(data[,col])
}

#if we use the gower distance and daisy() function, the following metrics can be considered to converted to ordered factor; but since gower distance turns out to be too expensive for large dataset, we have treated the following as normal factors as well!
for(col in c("reportedyear","reportedmonth","reportedday","reporteddayofyear","reporteddayofweek",
             "reportedhour","occurrenceyear","occurrencemonth","occurrenceday","occurrencedayofyear",
             "occurrencedayofweek","occurrencehour")) {
  data[,col] = factor(data[,col])
}

#drop unused factor levels
for(col in names(data)) {
  if(is.factor(data[,col])) {
    data[,col] = droplevels(data[,col])
  }
}

Finally, we cherck for missing data one last time:

# Check missing values one last time
colSums(is.na(data))
##                   X      occurrencedate        reporteddate 
##                   0                   0                   0 
##         premisetype             offence        reportedyear 
##                   0                   0                   0 
##       reportedmonth         reportedday   reporteddayofyear 
##                   0                   0                   0 
##   reporteddayofweek        reportedhour      occurrenceyear 
##                   0                   0                   0 
##     occurrencemonth       occurrenceday occurrencedayofyear 
##                   0                   0                   0 
## occurrencedayofweek      occurrencehour                 MCI 
##                   0                   0                   0 
##            Division             Hood_ID       Neighbourhood 
##                   0                   0                   0 
##                 Lat                Long      occurrencetime 
##                   0                   0                   0 
##        reportedtime 
##                   0

V. Modeling

With the data prepared, we can now start looking at models.

(1) Clustering strategy I.

First, we can look at clustering crime by neighborhood. We need to coerce the data into a clusterable table, sorted by MCI and neighborhood.

# Neighbourhood first
#first, coerce the data into a table that can be clustered - we aren't interested in the occurence date at this point
#courtesy of Susan Li - https://datascienceplus.com/exploring-clustering-and-mapping-torontos-crimes/
bygroup <- group_by(data, MCI, Hood_ID)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("Hood_ID", "MCI", "n")]
hood <- as.data.frame(spread(groups, key=MCI, value=n))
hood_id = as.integer(hood[,"Hood_ID"])
hood = hood[,-1]
head(hood)
##   Assault Auto_Theft Break_and_Enter Robbery Theft_Over
## 1     925       1091             501     243        187
## 2     867        203             132     302         14
## 3     203         59              84      73         10
## 4     233         74              61      57          4
## 5     206         53              56      66          4
## 6     389        175             144      73         19

Then we can use this table to perform k-means clustering. First, we need to normalize the data and determine the number of clusters. To determine the number of clusters, we simply plot the within-cluster sum-of-squares and pick a number after inspecting this plot.

hood <- scale(hood)
#determine number of clusters
wssplot <- function(data, nc=15, seed=1234) {
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc) {
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)
    }
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")
}

#we can see there's an elbow around 3 clusters
wssplot(hood, nc=15)

Now it seems number 3 is a good choice of the number of clusters. We then proceed to perform K-means clustering on the data-set. We have followed this tutorial(https://www.datanovia.com/en/lessons/cluster-validation-statistics-must-know-methods/) to evaluate the quality of clustering results. Specifically, we have used Silhouette width as a quantitative measures to evaluate the clustering quality. As we will see in the Silhouette plot, both in cluster #2 and cluster #3 have data points where the Silhouette width falls to negative, indicating a not so ideal clustering. This can also be observed in the 1st plot where we see that cluster #2 and cluster #3 is quite close to each other.

# K-means clustering
km.res <- eclust(hood, "kmeans", k = 3, graph = T)

fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1   10          0.01
## 2       2   37          0.21
## 3       3   93          0.59

Let’s look at the cluster means from the outputs to see what we can learn.

# k-means results
km.res
## K-means clustering with 3 clusters of sizes 10, 37, 93
## 
## Cluster means:
##      Assault Auto_Theft Break_and_Enter    Robbery Theft_Over
## 1  2.5319589  1.6064456       2.5015651  2.3608452  2.9465633
## 2  0.5229260  0.3375853       0.6012997  0.5174595  0.2914680
## 3 -0.4802995 -0.3070442      -0.5082122 -0.4597253 -0.4327952
## 
## Clustering vector:
##   [1] 1 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 2 3 3 2 2 2 1 3 3 3 2 3 3 3 3
##  [36] 3 3 3 2 2 3 3 3 3 2 3 2 3 3 2 2 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2
##  [71] 3 3 1 2 1 1 1 1 2 3 3 2 3 3 3 2 3 3 3 3 3 3 2 3 1 3 3 2 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 2 3 2 3 3 3 2 3 2 2 2 2 3 2 3 2 2 2 3 2 2 2 3 3 3 2 1 2 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 153.40984  68.48510  45.97407
##  (between_SS / total_SS =  61.5 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"

We can see that cluster 3 has lower than average crime. It has 93 neighborhoods, meaning that the majority of Toronto is quite safe in terms of the number of incidents occurring. Cluster 2 has slightly above average crime numbers, and it only has 37 neighborhoods. 10 neighborhoods have much higher than average crime numbers, which is seen in cluster 3. In some way, we can interpret the results as such that crime is concentrated in these small pockets of Toronto.

As an interesting activity, we also looking at a 3D representation of the clustering results. With 3 principal components, more variations should be captured that as in the case in 2 principal components. With a third dimension, we can see that there is more of a spread than initially can be seen simply in two dimensions.

#3D plot
pc <-princomp(hood, cor=TRUE, scores=TRUE)
plot3d(pc$scores[,1:3], col=km.res$cluster, main="k-means clusters")

Next, we can look at the neighborhoods from a hierarchical clustering approach. Again, we need to determine how many clusters we want to have. For hierarchical clustering, we look at the total number of observations that end up in different clusters with different configurations(in this case, the configuration is the number of clusters).

counts <- sapply(2:6, function(ncl) eclust(hood, "hclust", k = ncl, hc_metric = "euclidean", hc_method = "ward.D2")$size)
names(counts) <- 2:6
counts
## $`2`
## [1]  10 130
## 
## $`3`
## [1] 10 44 86
## 
## $`4`
## [1]  1 44 86  9
## 
## $`5`
## [1]  1 44 86  7  2
## 
## $`6`
## [1]  1 15 86  7 29  2

Still, we can see that 3 clusters are not bad if we consider only the cluster size as the criteria for choosing the number of clusters. We don’t want to split the neighborhoods into clusters with a small number of neighborhoods.

Now that we have chosen the number of clusters, we proceed to the clustering process. As previously in k-means, we look at Silhouette width as a quantitative measure to evaluate the clustering quality. To visually inspect the clusters, we plot dendrograms.

# Hierarchical clustering
hc.res <- eclust(hood, "hclust", k = 3, hc_metric = "euclidean", 
                 hc_method = "ward.D2")

# Visualize dendrograms
fviz_dend(hc.res, show_labels = FALSE,
         palette = "jco", as.ggplot = TRUE)

fviz_silhouette(hc.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1   10          0.04
## 2       2   44          0.16
## 3       3   86          0.61

It seems like we have reached the same conclusion as the k-means algorithms: cluster #1 and cluster #2 does not have large gap. The following codes plot the clustering results on the map. We also plot the offence count in each neighbourhood and it can be seen that the 2nd plot is highly correlated with the 1st plot (which makes sense).

cluster_ids <- km.res$cluster
hood_ids_and_cluster_ids <- data.frame(cbind(hood_id,cluster_ids))
hood_ids_and_cluster_ids$cluster_ids = as.factor(hood_ids_and_cluster_ids$cluster_ids)
shpfile <- paste(data_dir,"NEIGHBORHOODS_WGS84_2.shp",sep="/")
sh <- readShapePoly(shpfile)
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
sh@data$AREA_S_CD <- as.integer(sh@data$AREA_S_CD)

hood_name_and_cluster_ids = merge(hood_ids_and_cluster_ids,sh@data,by.x='hood_id',by.y='AREA_S_CD')
points_clustering <- fortify(sh, region = 'AREA_S_CD')
points_clustering <- merge(points_clustering, hood_name_and_cluster_ids, by.x='id', by.y='hood_id', all.x=TRUE)

p1 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=cluster_ids), data=points_clustering, color='black') +
  scale_fill_brewer(palette = "Set2")
p2 = torontoMap + geom_polygon(aes(x=long,y=lat, group=group, fill=offence_cnt), data=points_offense_cnt, color='black') +
  scale_fill_distiller(palette='Spectral') + scale_alpha(range=c(0.5,0.5))
p3 = ggarrange(p1, p2, ncol = 2, nrow = 1, common.legend = F)
print(p3)

(2) Clustering strategy II.

We can perform clustering on lat/long to look at natural crime hotspots. We can compare this to a manual cluster like the Toronto police divisions.

latlong <- data[, colnames(data) %in% c("Lat", "Long")]

# k-means 
# 34 divisions in Toronto
k.means.fit <- kmeans(latlong, 34)
torclus <- as.data.frame(k.means.fit$centers)
torclus$size <- k.means.fit$size
latlongclus <- latlong
latlongclus$cluster <- as.factor(k.means.fit$cluster)
tormap <- get_map(location =c(left=-79.8129, bottom=43.4544, right=-78.9011, top=43.9132))
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=43.6838,-79.357&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN
#plot each incident, color-coded by cluster
ggmap(tormap) +
  geom_point( data= latlongclus, aes(x=Long[], y=Lat[], color= as.factor(cluster))) +
  theme_void() + coord_map() 
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

#plot bubble map with cluster centroids, size/ color determined by the number of incidents in each cluster
ggmap(tormap) +
  geom_point( data= torclus, aes(x=Long[], y=Lat[], size=size)) +
  theme_void() + coord_map() 
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

We can see how the clusters look with both the incidents plotted as well as the centroids. Next, we can perform k-means clustering on the natural crime hotspot clusters and on the Toronto Police divisions to see how they compare to one another.

#coerce data for Hotspots
data2 <- data
data2$cluster <- k.means.fit$cluster
bygroup <- group_by(data2, MCI, cluster)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("cluster", "MCI", "n")]
hotspot <- data.frame(spread(groups, key=MCI, value=n))
hotspot <- hotspot[, -1]
hotspot = data.frame(scale(hotspot))
#determine number of clusters
#we can see there's an elbow around 4 clusters
wssplot(hotspot, nc=15)

We can see here that there is an elbow around 4, so we can run k means with 4 clusters. With these 4 clusters, we can see that cluster 4 has higher than average auto thefts, and slightly higher robberies, but is below average for the other 3 MCIs. Cluster 1 is a safe part of Toronto with lower crime incidents. Cluster 3 has slightly more assaults, break and enters, and robberies, but lower auto thefts and theft overs. Cluster 2, however, is the most crime-centric area of Toronto, with many more incidents than average, excepting auto thefts. Luckily, there are only 3 hotspots in cluster 2, while there are 12 and 13 in clusters 1 and 3 respectively, which had generally fewer incidents overall.

From the silhouette plot, we can see that clustering based on this data set is a little better as compared to Clustering strategy I. When we plot the clusters using principal component analysis in 2D, we can see that the clusters are relatively well separated. Cluster 1 and 4 are spread out, but clusters 2 and 3 are tighter. On the map, we can also see that cluster 4 is concentrated in downtown Toronto, while cluster 1 is in the northwest part.

km.res <- eclust(hotspot, "kmeans", k = 4, graph = T)

fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1   16          0.22
## 2       2    4          0.22
## 3       3    4          0.24
## 4       4   10          0.46

km.res
## K-means clustering with 4 clusters of sizes 16, 4, 4, 10
## 
## Cluster means:
##       Assault Auto_Theft Break_and_Enter     Robbery Theft_Over
## 1 -0.03088279 -0.2256176       0.3005858  0.02643693 -0.3488020
## 2  0.55292426  2.4255005      -0.5324398  1.12469735  0.5700703
## 3  1.72830722 -0.6195701       1.6301725  1.22286588  2.0263084
## 4 -0.86308013 -0.3613840      -0.9200304 -0.98132438 -0.4804683
## 
## Clustering vector:
##  [1] 4 1 1 4 1 1 4 3 3 4 4 1 1 1 4 2 2 4 1 1 1 1 1 2 2 3 1 4 3 1 4 1 4 1
## 
## Within cluster sum of squares by cluster:
## [1] 22.932432 10.847959 12.848323  6.212482
##  (between_SS / total_SS =  68.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"
hotspot$cluster <- factor(km.res$cluster)
ggmap(tormap) +
  geom_point( data= torclus, aes(x=Long[], y=Lat[], color = hotspot$cluster)) +
  theme_void() + coord_map()
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

(3) Clustering strategy III.

The following code documents the third way we have tried in terms of clustering. Instead of aggregating the dataset based on neighbourhood, we can aggregate the dataset according to the police division that each event belongs to.

#coerce data for Divisions
bygroup <- group_by(data, MCI, Division)
groups <- dplyr::summarise(bygroup, n=n())
groups <- groups[c("Division", "MCI", "n")]
div <- as.data.frame(spread(groups, key=MCI, value=n))
div_name = div[,1]
div <- div[, -1]

#normalize
for(col in names(div)) {
  div[,col] = (div[,col] - mean(div[,col])) / sd(div[,col])
}

#determine number of clusters
#we can see there's an elbow around 3 clusters
wssplot(div, nc=15)

# k-means
km.res <- eclust(div, "kmeans", k = 3, graph = T)

fviz_silhouette(km.res, palette = "jco", ggtheme = theme_classic())
##   cluster size ave.sil.width
## 1       1    8          0.46
## 2       2    9          0.21
## 3       3   17          0.96

km.res
## K-means clustering with 3 clusters of sizes 8, 9, 17
## 
## Cluster means:
##      Assault Auto_Theft Break_and_Enter    Robbery Theft_Over
## 1  0.4761876  0.1302555       0.5495307  0.3599759  0.5014688
## 2  1.2770657  1.2633142       1.1990178  1.3373206  1.1973892
## 3 -0.9001819 -0.7301101      -0.8933768 -0.8773937 -0.8698972
## 
## Clustering vector:
##  [1] 1 3 1 3 1 3 2 3 2 3 2 3 2 3 2 3 1 3 2 3 2 3 2 3 2 3 1 3 1 3 1 3 1 3
## 
## Within cluster sum of squares by cluster:
## [1]  6.9416751 17.1183040  0.1890451
##  (between_SS / total_SS =  85.3 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"    
##  [5] "tot.withinss" "betweenss"    "size"         "iter"        
##  [9] "ifault"       "clust_plot"   "silinfo"      "nbclust"     
## [13] "data"

Similar to clustering after neighborhood-based aggregation, two clusters have low crime incidents (1 and 2), while cluster 3 has higher crime incidents. Most districts are lower crime incident districts, while 9 districts are higher. However, from both the visual representation (based on principal components) of the clustering and silhouette plot, police-division-based aggregation clearly works better than neighborhood-based aggregation.

(3) Clustering strategy IV.

So far we have aggregated crime statistics into either neibourhood or division. What if we run clustering on data that excludes geographical information. Does such clustering produce naturally occuring crime zone? We will find out in the following analysis.

library(clustMixType)
data_fullclust = data[,c("occurrencetime","reportedtime","premisetype","offence","MCI")]
data_fullclust$occurrencetime = as.numeric(data_fullclust$occurrencetime)
data_fullclust$reportedtime = as.numeric(data_fullclust$reportedtime)

However, one important distinction as compared to previous clustering analysises is that, we now have mixed data type, being numeric and categorical. To handle this situation, we have experimented using the daisy function and Gower’s distance formula. However, Gower’s distance formula drastically increase memory consumption when the dataset is large, making it not pratical to run on our own computers. As such, we swtiched to another algorithms “k-prototypes clustering” that can handle mixed type data. As in kmeans, we determine the number of clusters based on the trends shown in within cluster sum of squares.

library(rlist)
set.seed(1234)
kproto_clusters = list()
for (i in seq(2,8,1)) {
  kproto_clusters = list.append(kproto_clusters,kproto(data_fullclust,i,lambda = 1))
}
## # NAs in variables:
## occurrencetime   reportedtime    premisetype        offence            MCI 
##              0              0              0              0              0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
## occurrencetime   reportedtime    premisetype        offence            MCI 
##              0              0              0              0              0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
## occurrencetime   reportedtime    premisetype        offence            MCI 
##              0              0              0              0              0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
## occurrencetime   reportedtime    premisetype        offence            MCI 
##              0              0              0              0              0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
## occurrencetime   reportedtime    premisetype        offence            MCI 
##              0              0              0              0              0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
## occurrencetime   reportedtime    premisetype        offence            MCI 
##              0              0              0              0              0 
## 0 observation(s) with NAs.
## 
## # NAs in variables:
## occurrencetime   reportedtime    premisetype        offence            MCI 
##              0              0              0              0              0 
## 0 observation(s) with NAs.
wss = c()
for(kc in kproto_clusters) {
  wss = c(wss,kc$tot.withinss)
}
plot(seq(2,8,1),wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

We see here that, beyond 4 clusters (or 7 clusters), there seems to be diminishing returns in reducing within-cluster sum of squres. We choose 4 clusters in this case. Then plot the clustering results on a map:

kproto_selection = kproto_clusters[[3]]
data$cluster_id = as.factor(kproto_selection$cluster)
to_map <- data.frame(data$MCI, data$Lat, data$Long, data$cluster_id)
colnames(to_map) <- c('crimes', 'lat', 'lon', 'cluster_id')
sbbox <- make_bbox(lon = data$Long, lat = data$Lat, f = 0.05)
my_map <- get_map(location = sbbox, maptype = "roadmap", scale = 2, color="color", zoom = 10)
## Warning: bounding box given to google - spatial extent only approximate.
## converting bounding box to center/zoom specification. (experimental)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=43.717524,-79.379169&zoom=10&size=640x640&scale=2&maptype=roadmap&language=en-EN
ggmap(my_map) +
  geom_point(data=to_map, aes(x = lon, y = lat, color = cluster_id), 
             size = 0.5, alpha = 1) +
  xlab('Longitude') +
  ylab('Latitude') +
  ggtitle('Place Holder')

Just by visual inspection on the map, it’s hard to see any patterns. What if we inspect the statistics within each clusters?

kproto_results <- data %>%
  dplyr::select(-X,-occurrencedate,-reporteddate,-occurrencetime,-reportedtime, -Lat, -Long) %>%
  group_by(cluster_id) %>%
  do(the_summary = summary(.))
print(kproto_results$the_summary)
## [[1]]
##      premisetype                      offence      reportedyear
##  Apartment : 9282   Assault               :15231   2014:    0  
##  Commercial: 7296   B&E                   : 7704   2015:20154  
##  House     : 7732   Theft_Of_Motor_Vehicle: 4006   2016:18518  
##  Other     : 3958   Assault_With_Weapon   : 2663   2017:   79  
##  Outside   :10483   Robbery_-_Mugging     : 1333               
##                     B&E_W'Intent          : 1093               
##                     (Other)               : 6721               
##  reportedmonth    reportedday    reporteddayofyear reporteddayofweek
##  7      : 4931   27     : 1388   179    :  194     1:5427           
##  6      : 4872   16     : 1356   226    :  194     2:5728           
##  5      : 4801   22     : 1355   165    :  192     3:5560           
##  8      : 4613   20     : 1318   178    :  192     4:5456           
##  4      : 2905   25     : 1316   180    :  186     5:5497           
##  9      : 2541   18     : 1302   240    :  184     6:5676           
##  (Other):14088   (Other):30716   (Other):37609     7:5407           
##   reportedhour   occurrenceyear  occurrencemonth occurrenceday  
##  15     : 2100   2015   :20527   7      : 4935   1      : 1676  
##  17     : 2056   2016   :18159   6      : 4892   27     : 1336  
##  18     : 2036   2014   :   61   5      : 4864   26     : 1318  
##  16     : 2011   2013   :    4   8      : 4510   20     : 1316  
##  13     : 1993   1966   :    0   4      : 2995   15     : 1312  
##  20     : 1959   1968   :    0   10     : 2519   25     : 1312  
##  (Other):26596   (Other):    0   (Other):14036   (Other):30481  
##  occurrencedayofyear occurrencedayofweek occurrencehour 
##  152    :  199       1:5524              0      : 2465  
##  178    :  197       2:5371              12     : 2223  
##  121    :  191       3:5264              20     : 2070  
##  182    :  188       4:5347              21     : 2052  
##  177    :  187       5:5496              22     : 2045  
##  150    :  185       6:6031              18     : 2028  
##  (Other):37604       7:5718              (Other):25868  
##               MCI           Division        Hood_ID     
##  Assault        :20545   D41    : 3045   75     : 1336  
##  Auto_Theft     : 4006   D32    : 2884   77     : 1261  
##  Break_and_Enter: 9042   D51    : 2870   1      : 1006  
##  Robbery        : 3788   D43    : 2818   73     :  819  
##  Theft_Over     : 1370   D14    : 2809   76     :  811  
##                          D31    : 2705   27     :  781  
##                          (Other):21620   (Other):32737  
##                                 Neighbourhood   cluster_id
##  Church-Yonge Corridor (75)            : 1336   1:38751   
##  Waterfront Communities-The Island (77): 1261   2:    0   
##  West Humber-Clairville (1)            : 1006   3:    0   
##  Moss Park (73)                        :  819   4:    0   
##  Bay Street Corridor (76)              :  811             
##  York University Heights (27)          :  781             
##  (Other)                               :32737             
## 
## [[2]]
##      premisetype                        offence   reportedyear
##  Apartment :28   Assault                    :43   2014:33     
##  Commercial: 4   Assault_With_Weapon        :12   2015:20     
##  House     :31   Theft_Over                 : 7   2016:13     
##  Other     : 7   Theft_Of_Motor_Vehicle     : 4   2017: 8     
##  Outside   : 4   B&E                        : 2               
##                  Administering_Noxious_Thing: 1               
##                  (Other)                    : 5               
##  reportedmonth  reportedday reporteddayofyear reporteddayofweek
##  1      :10    6      : 6   34     : 2        1: 5             
##  3      : 9    12     : 4   69     : 2        2:15             
##  4      : 9    19     : 4   110    : 2        3:18             
##  10     : 9    22     : 4   237    : 2        4:13             
##  6      : 7    24     : 4   288    : 2        5: 9             
##  2      : 6    2      : 3   309    : 2        6:11             
##  (Other):24    (Other):49   (Other):62        7: 3             
##   reportedhour occurrenceyear occurrencemonth occurrenceday
##  12     : 9    2000   :12     1      :45      1      :56   
##  11     : 8    2004   : 8     10     : 5      13     : 2   
##  10     : 6    2002   : 7     4      : 4      20     : 2   
##  14     : 6    2003   : 6     9      : 4      23     : 2   
##  19     : 6    2001   : 5     3      : 3      24     : 2   
##  13     : 5    1996   : 4     6      : 3      30     : 2   
##  (Other):34    (Other):32     (Other):10      (Other): 8   
##  occurrencedayofyear occurrencedayofweek occurrencehour
##  1      :44          1: 8                0      :42    
##  182    : 2          2:16                12     :12    
##  244    : 2          3: 8                8      : 3    
##  275    : 2          4: 7                16     : 3    
##  31     : 1          5:14                1      : 2    
##  32     : 1          6: 9                9      : 2    
##  (Other):22          7:12                (Other):10    
##               MCI        Division     Hood_ID  
##  Assault        :58   D54    : 8   60     : 4  
##  Auto_Theft     : 4   D42    : 7   131    : 4  
##  Break_and_Enter: 2   D53    : 7   2      : 3  
##  Robbery        : 2   D14    : 6   55     : 3  
##  Theft_Over     : 8   D31    : 5   23     : 2  
##                       D32    : 5   25     : 2  
##                       (Other):36   (Other):56  
##                                Neighbourhood cluster_id
##  Rouge (131)                          : 4    1: 0      
##  Woodbine-Lumsden (60)                : 4    2:74      
##  Mount Olive-Silverstone-Jamestown (2): 3    3: 0      
##  Thorncliffe Park (55)                : 3    4: 0      
##  Dorset Park (126)                    : 2              
##  Glenfield-Jane Heights (25)          : 2              
##  (Other)                              :56              
## 
## [[3]]
##      premisetype                     offence      reportedyear
##  Apartment :8953   Assault               :13461   2014:27854  
##  Commercial:6689   B&E                   : 7364   2015: 7921  
##  House     :7359   Theft_Of_Motor_Vehicle: 4134   2016:   61  
##  Other     :3892   Assault_With_Weapon   : 2332   2017:   30  
##  Outside   :8973   B&E_W'Intent          : 1386               
##                    Robbery_-_Mugging     : 1303               
##                    (Other)               : 5886               
##  reportedmonth    reportedday    reporteddayofyear reporteddayofweek
##  3      : 4335   10     : 1286   1      :  192     1:4975           
##  1      : 4222   19     : 1254   100    :  178     2:5237           
##  4      : 3883   17     : 1250   93     :  171     3:5208           
##  2      : 3774   1      : 1244   97     :  166     4:5170           
##  5      : 2563   3      : 1243   41     :  162     5:5142           
##  10     : 2533   9      : 1238   49     :  160     6:5176           
##  (Other):14556   (Other):28351   (Other):34837     7:4958           
##   reportedhour   occurrenceyear  occurrencemonth occurrenceday  
##  15     : 2038   2014   :27733   1      : 4317   1      : 1939  
##  14     : 2019   2015   : 7438   3      : 4261   18     : 1260  
##  18     : 1989   2013   :  435   4      : 3764   15     : 1229  
##  16     : 1973   2012   :  102   2      : 3760   17     : 1227  
##  17     : 1924   2011   :   55   10     : 2563   5      : 1222  
##  13     : 1827   2010   :   44   5      : 2523   14     : 1220  
##  (Other):24096   (Other):   59   (Other):14678   (Other):27769  
##  occurrencedayofyear occurrencedayofweek occurrencehour 
##  1      :  393       1:5009              0      : 2466  
##  91     :  197       2:4905              12     : 2102  
##  60     :  176       3:5013              18     : 1959  
##  31     :  172       4:5024              21     : 1877  
##  32     :  169       5:5083              20     : 1808  
##  92     :  168       6:5439              17     : 1799  
##  (Other):34591       7:5393              (Other):23855  
##               MCI           Division        Hood_ID     
##  Assault        :17978   D41    : 2909   75     : 1100  
##  Auto_Theft     : 4134   D43    : 2834   77     : 1049  
##  Break_and_Enter: 8960   D32    : 2725   1      :  931  
##  Robbery        : 3525   D51    : 2531   73     :  690  
##  Theft_Over     : 1269   D14    : 2499   27     :  689  
##                          D31    : 2472   137    :  684  
##                          (Other):19896   (Other):30723  
##                                 Neighbourhood   cluster_id
##  Church-Yonge Corridor (75)            : 1100   1:    0   
##  Waterfront Communities-The Island (77): 1049   2:    0   
##  West Humber-Clairville (1)            :  931   3:35866   
##  Moss Park (73)                        :  690   4:    0   
##  York University Heights (27)          :  689             
##  Woburn (137)                          :  684             
##  (Other)                               :30723             
## 
## [[4]]
##      premisetype                      offence      reportedyear
##  Apartment : 9358   Assault               :14932   2014:    0  
##  Commercial: 7786   B&E                   : 7618   2015:    0  
##  House     : 7406   Theft_Of_Motor_Vehicle: 4244   2016: 9686  
##  Other     : 4539   Assault_With_Weapon   : 3235   2017:29531  
##  Outside   :10128   Robbery_-_Mugging     : 1281               
##                     B&E_W'Intent          : 1052               
##                     (Other)               : 6855               
##  reportedmonth    reportedday    reporteddayofyear reporteddayofweek
##  10     : 5118   18     : 1404   333    :  204     1:5620           
##  11     : 5114   23     : 1353   304    :  201     2:5781           
##  9      : 4853   28     : 1342   323    :  201     3:5528           
##  12     : 4645   24     : 1339   293    :  195     4:5581           
##  8      : 2795   15     : 1335   316    :  190     5:5495           
##  7      : 2721   13     : 1333   338    :  190     6:5698           
##  (Other):13971   (Other):31111   (Other):38036     7:5514           
##   reportedhour   occurrenceyear  occurrencemonth occurrenceday  
##  15     : 2137   2017   :29211   10     : 5099   1      : 1551  
##  16     : 2092   2016   : 9996   11     : 5054   18     : 1395  
##  18     : 2052   2015   :   10   9      : 4896   22     : 1355  
##  19     : 2050   1966   :    0   12     : 4549   24     : 1354  
##  12     : 2007   1968   :    0   8      : 2876   2      : 1329  
##  20     : 2006   1970   :    0   7      : 2740   15     : 1325  
##  (Other):26873   (Other):    0   (Other):14003   (Other):30908  
##  occurrencedayofyear occurrencedayofweek occurrencehour 
##  323    :  200       1:5780              0      : 2358  
##  259    :  194       2:5385              12     : 2203  
##  293    :  194       3:5281              21     : 2125  
##  289    :  193       4:5429              18     : 2122  
##  304    :  193       5:5448              22     : 2089  
##  328    :  190       6:5909              20     : 2080  
##  (Other):38053       7:5985              (Other):26240  
##               MCI           Division        Hood_ID     
##  Assault        :20665   D32    : 2877   75     : 1544  
##  Auto_Theft     : 4244   D51    : 2761   77     : 1375  
##  Break_and_Enter: 8900   D14    : 2684   1      : 1010  
##  Robbery        : 3971   D41    : 2674   76     :  888  
##  Theft_Over     : 1437   D31    : 2552   73     :  796  
##                          D43    : 2481   27     :  756  
##                          (Other):23188   (Other):32848  
##                                 Neighbourhood   cluster_id
##  Church-Yonge Corridor (75)            : 1544   1:    0   
##  Waterfront Communities-The Island (77): 1375   2:    0   
##  West Humber-Clairville (1)            : 1010   3:    0   
##  Bay Street Corridor (76)              :  888   4:39217   
##  Moss Park (73)                        :  796             
##  York University Heights (27)          :  756             
##  (Other)                               :32848

Again, it’s not entirely obvious how each cluster differs from one another. However, we do spot an interesting observation on the 2nd cluster. Previously we observed that people can report crimes that happened much earlier than the time when they reported. The 2nd clusters seem to identify these reports (notice the difference of occurrenceyear and reportedyear in this clusters).

VI. Evaluation

Throughout the clustering analysis in part V, we have evaluate the clustering quality by ploting visual representations of the clustering and by ploting the silhouette plot. As a short summary, using police-division-based aggregation and neighborhood-based aggregation both produced clustering results where we can spot a pattern in the clustering. Yet police-division-based aggregation seems to provide a higher quality of clustering. The other two clustering strategies that we have used, on the other hand, are not so informative, at least not from what we can tell immediately.

VII. Deployment

An R Shiny app was created(please run server.R to see the app). We have produced this very informative Rshiny app which would be very interesting to those who is eager to know the crime levels and statistics in Toronto, for example, those who are looking to buy or rent properties. The app itself can be readily deployed to interested readers without much concerns regarding scaling since the dataset collected (or crime reported) are relatively limited in size unles richer dimensions are introduced to the data collecting activities.

The clusters themselves can be used for sending police to high crime areas. As more crime happens, the clusters can be updated to better understand where crime is happening in Toronto, and to see if increased police presence in certain areas helps to reduce crime.

As with other data products, maintenance is expected to some extent to keep the data and analysis updated. The following steps can be taken: (1) Collect data on an on-going basis;
(2) Collect more features and experiment if additional features could further drive up clustering quality, or prediction accuracy in case that the clustering is used as a preprocessing step for predictive analysis;
(3) Establish a way to evaluate the source of data inconsistencies during data collection activities;
(4) Rerun modeling in a timely manner.

However, we don’t need to worry about scaling.

VIII. Conclusions

From what we understand from the data, the clustering activities are best served as: (1) A preprocessing step for predictive analysis, which reduces the dimensions of the data and alleviates the problem of “the curse of dimensionality”; (2) Exploratory activities to help us understand the data.

For this specific task on Toronto’s crime statistics, one can readily recognize clusters based on how many offenses happened in each neighborhood or in each police division. Simple clustering algorithms such as k-means or hierarchical clustering help with recognizing and visualizing such clusters. They are an interesting way to visualize crime that happened in Toronto. Using this historical data, the Toronto Police Service can better allocate their resources to areas that have more crime, whether by neighbourhood, division, or natural hotspots.

Some areas for improvement include looking at crime incidents for neighbourhood or division based on the population of the area. An area with a higher population might have higher crime incidents simply by virtue of having more residents. However, it would be interesting to know if an area had low crime but high population. Another aspect for improvement would be to look at the demographics of the areas. This would enable us to see if there is any bias in how crime occurrences are reported, and the data could be adjusted for this bias.